Goto

Collaborating Authors

 general public


Mathematics is hard for mathematicians to understand too Science

Science

At a recent conference on mathematics in the age of automated proofs, mathematician and Fields Medalist Akshay Venkatesh presented “How do we talk to our students about AI?'' He quoted an email he'd received from a young student who asked, “Do you believe that mathematics is worth being studied in a world in which a machine can answer everything for you? What do you believe would be the 'job’ of a mathematician in this world?” Venkatesh framed AI as an opportunity to correct what he called an “essential gap that has opened between the practice of mathematics and our values.” Mathematician William Thurston has explained these values by writing, “mathematics is not about numbers, equations, computations, or algorithms: it is about understanding.” But Venkatesh argued that the record on this is terrible, lamenting that “for a typical paper or talk, very few of us understand it.” He is not alone in thinking that something is wrong with the current state of mathematics research.


Are Today's LLMs Ready to Explain Well-Being Concepts?

Jiang, Bohan, Li, Dawei, Tan, Zhen, Zhao, Chengshuai, Liu, Huan

arXiv.org Artificial Intelligence

Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.


NLP Meets the World: Toward Improving Conversations With the Public About Natural Language Processing Research

Wilson, Shomir

arXiv.org Artificial Intelligence

Recent developments in large language models (LLMs) have been accompanied by rapidly growing public interest in natural language processing (NLP). This attention is reflected by major news venues, which sometimes invite NLP researchers to share their knowledge and views with a wide audience. Recognizing the opportunities of the present, for both the research field and for individual researchers, this paper shares recommendations for communicating with a general audience about the capabilities and limitations of NLP. These recommendations cover three themes: vague terminology as an obstacle to public understanding, unreasonable expectations as obstacles to sustainable growth, and ethical failures as obstacles to continued support. Published NLP research and popular news coverage are cited to illustrate these themes with examples. The recommendations promote effective, transparent communication with the general public about NLP, in order to strengthen public understanding and encourage support for research.


Misalignments in AI Perception: Quantitative Findings and Visual Mapping of How Experts and the Public Differ in Expectations and Risks, Benefits, and Value Judgments

Brauner, Philipp, Glawe, Felix, Liehner, Gian Luca, Vervier, Luisa, Ziefle, Martina

arXiv.org Artificial Intelligence

Artificial Intelligence (AI) is transforming diverse societal domains, raising critical questions about its risks and benefits and the misalignments between public expectations and academic visions. This study examines how the general public (N=1110) -- people using or being affected by AI -- and academic AI experts (N=119) -- people shaping AI development -- perceive AI's capabilities and impact across 71 scenarios, including sustainability, healthcare, job performance, societal divides, art, and warfare. Participants evaluated each scenario on four dimensions: expected probability, perceived risk and benefit, and overall sentiment (or value). The findings reveal significant quantitative differences: experts anticipate higher probabilities, perceive lower risks, report greater utility, and express more favorable sentiment toward AI compared to the non-experts. Notably, risk-benefit tradeoffs differ: the public assigns risk half the weight of benefits, while experts assign it only a third. Visual maps of these evaluations highlight areas of convergence and divergence, identifying potential sources of public concern. These insights offer actionable guidance for researchers and policymakers to align AI development with societal values, fostering public trust and informed governance.


UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

Yang, Yuzhe, Zhang, Yifei, Hu, Yan, Guo, Yilin, Gan, Ruoli, He, Yueru, Lei, Mingcong, Zhang, Xiao, Wang, Haining, Xie, Qianqian, Huang, Jimin, Yu, Honghai, Wang, Benyou

arXiv.org Artificial Intelligence

This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 12 LLM services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial sector but also provides a robust framework for assessing their performance and user satisfaction. The benchmark dataset and evaluation code are available.


The Foundation Model Transparency Index

Bommasani, Rishi, Klyman, Kevin, Longpre, Shayne, Kapoor, Sayash, Maslej, Nestor, Xiong, Betty, Zhang, Daniel, Liang, Percy

arXiv.org Artificial Intelligence

Foundation models have rapidly permeated society, catalyzing a wave of generative AI applications spanning enterprise and consumer-facing contexts. While the societal impact of foundation models is growing, transparency is on the decline, mirroring the opacity that has plagued past digital technologies (e.g. social media). Reversing this trend is essential: transparency is a vital precondition for public accountability, scientific innovation, and effective governance. To assess the transparency of the foundation model ecosystem and help improve transparency over time, we introduce the Foundation Model Transparency Index. The Foundation Model Transparency Index specifies 100 fine-grained indicators that comprehensively codify transparency for foundation models, spanning the upstream resources used to build a foundation model (e.g data, labor, compute), details about the model itself (e.g. size, capabilities, risks), and the downstream use (e.g. distribution channels, usage policies, affected geographies). We score 10 major foundation model developers (e.g. OpenAI, Google, Meta) against the 100 indicators to assess their transparency. To facilitate and standardize assessment, we score developers in relation to their practices for their flagship foundation model (e.g. GPT-4 for OpenAI, PaLM 2 for Google, Llama 2 for Meta). We present 10 top-level findings about the foundation model ecosystem: for example, no developer currently discloses significant information about the downstream impact of its flagship model, such as the number of users, affected market sectors, or how users can seek redress for harm. Overall, the Foundation Model Transparency Index establishes the level of transparency today to drive progress on foundation model governance via industry standards and regulatory intervention.


Baidu receives green light to launch AI Ernie Bot for general public, leading China's AI revolution

FOX News

Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. Tech giant Baidu on Wednesday received approval by Chinese authorities to launch its artificial intelligence Ernie Bot to the general public starting Aug. 31, a spokesperson told Reuters. Baidu became the first company to receive such approval after regulatory setbacks and is also set to launch a suite of new AI-native apps. The company has been embedding Ernie, which resembles OpenAI's ChatGPT, into its search engine and other products, allowing many of them to gain market share while waiting for Chinese regulators' approval.


Chinese ChatGPT alternatives just got approved for the general public

MIT Technology Review

When Ernie Bot was released on March 16, the response was a mix of excitement and disappointment. Many people deemed its performance mediocre relative to the previously released ChatGPT. But most people simply weren't able to see it for themselves. The launch event didn't feature a live demonstration, and later, to actually try out the bot, Chinese users need to have a Baidu account and apply for a use license that could take as long as three months to come through. Because of this, some people who got access early were selling secondhand Baidu accounts on e-commerce sites, charging anywhere from a few bucks to over $100.


Industrial Memories: Exploring the Findings of Government Inquiries with Neural Word Embedding and Machine Learning

Leavy, Susan, Pine, Emilie, Keane, Mark T

arXiv.org Artificial Intelligence

We present a text mining system to support the exploration of large volumes of text detailing the findings of government inquiries. Despite their historical significance and potential societal impact, key findings of inquiries are often hidden within lengthy documents and remain inaccessible to the general public. We transform the findings of the Irish government's inquiry into industrial schools and through the use of word embedding, text classification and visualization, present an interactive web-based platform that enables the exploration of the text to uncover new historical insights.


Google launches new AI PaLM 2 in attempt to regain leadership of the pack

The Guardian

Google is attempting to reclaim its crown as the leader in artificial intelligence with PaLM 2, a "next-generation language model" that the company says outperforms other leading systems on some tasks. Revealing the cutting-edge AI at its annual I/O conference, alongside a foldable Pixel phone and a new tablet, Google said it would be built in to 25 new products and features, as the company races to catch up with competitors after years of producing AI research but few products. Like other "large language models" such as OpenAI's GPT, PaLM 2 is a general-purpose AI model, which can be used to power ChatGPT-style chatbots but also translate between languages, write computer code, or even analyse and respond to images. Combining those capabilities, a user could ask a question in English about a restaurant in Bulgaria, and the system would be able to search the web for Bulgarian responses, find an answer, translate the answer into English, add a picture of the location – and then follow up with a code snippet to create a database entry for the place. "The neural network revolution that we are now experiencing started around 10 years ago," said Slav Petrov, the co-lead of the PaLM 2 project, "and it started in part at Google."